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Abstract 

This article presents the ability of an omnibus pernnutation test on ensembles of two-locus analyses (2L0mb) to 
detect pure epistasis in the presence of genetic heterogeneity. The performance of 2L0mb is evaluated in various 
simulation scenarios covering two independent causes of complex disease where each cause is governed by a purely 
epistatic interaction. Different scenarios are set up by varying the number of available single nucleotide 
polymorphisms (SNPs) in data, number of causative SNPs and ratio of case samples from two affected groups. The 
simulation results indicate that 2L0mb outperforms multifactor dimensionality reduction (MDR) and random forest 
(RF) techniques in terms of a low number of output SNPs and a high number of correctly-identified causative SNPs. 
Moreover, 2L0mb is capable of identifying the number of independent interactions in tractable computational time 
and can be used in genome-wide association studies. 2L0mb is subsequently applied to a type 1 diabetes mellitus 
(Tl D) data set, which is collected from a UK population by the Wellcome Trust Case Control Consortium (WTCCC). 
After screening for SNPs that locate within or near genes and exhibit no marginal single-locus effects, the Tl D data set 
is reduced to 95,991 SNPs from 1 2,146 genes. The 2L0mb search in the reduced Tl D data set reveals that 1 2 SNPs, 
which can be divided into two independent sets, are associated with the disease. The first SNP set consists of three 
SNPs from MUC21 (mucin 21, cell surface associated), three SNPs from MUC22 (mucin 22), two SNPs from PSORSICI 
(psoriasis susceptibility 1 candidate 1) and one SNP from TCF19 (transcription factor 19). A four-locus interaction 
between these four genes is also detected. The second SNP set consists of three SNPs from ATADl (ATPase family, 
AAA domain containing 1). Overall, the findings indicate the detection of pure epistasis in the presence of genetic 
heterogeneity and provide an alternative explanation for the aetiology of Tl D in the UK population. 

Keywords: Attribute selection. Complex disease, Epistasis, Genetic heterogeneity. Genome-wide association study. 
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^Correspondence: n.chaiyaratana(5)gmail.com 

^ Department of Electrical and Computer Engineering, Faculty of Engineering, 
King Mongkut's University of Technology North Bangkok,! 51 8 Pracharat Sai 1 
Road, Bangsue, Bangkok 10800, Thailand 

"^Division of Molecular Genetics, Department of Research and Development, 
Faculty of Medicine Siriraj Hospital, Mahidol University, 2 Prannok Road, 
Bangkoknoi, Bangkok 10700, Thailand 

Full list of author information is available at the end of the article 



Spri 



inger 



© 201 3 Setsirichok et al.; licensee Springer. This is an Open Access article distributed under the terms of the Creative Commons 
Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction 
in any medium, provided the original work is properly cited. 



Setsirichok etal. SpringerPlus 201 3, 2:230 
http://www.springerplus.eom/content/2/1/230 



Page 2 of 1 7 



Background 

Epistasis or gene-gene interactions are among many 
causes of complex diseases (Moore 2005). In the sim- 
plest form, epistasis can be described by two-locus disease 
models, in which both loci jointly contribute towards 
the disease susceptibility (Neuman and Rice 1992; Schork 
et al 1993). Many attempts have been made to pro- 
vide consistent definitions of epistasis (Cordell 2002; 
Hallgrimsdottir and Yuster 2008; Li and Reich 2000; 
Marchini et al. 2005; Musani et al. 2007; Verhoeven et 
al. 2010). Regardless of preferred definitions, a common 
ground for describing epistasis covers an effect deviat- 
ing from the combined individual effects of each genetic 
factor. In other words, epistasis describes an effect that 
departs from a linear addition of individual effects (Fisher 
1918). The detection of epistasis hence provides neces- 
sary information complementary to that gained through 
single-locus analysis. 

With the availability of genome-wide genotyping tech- 
nologies, a large number of single nucleotide poly- 
morphisms (SNPs) can be considered during epistasis 
detection (Heidema et al. 2006; Motsinger et al. 2007; 
Van Steen 2012). At present, the most feasible strategy 
for genome-wide epistasis detection involves two-locus 
analysis (Evans et al. 2006; Gayan et al. 2008; lonita 
and Man 2006; Liu et al. 2011; Marchini et al. 2005; 
Sha et al. 2009; Wongseree et al. 2009). The detection 
may concentrate on all possible SNP pairs (Gayan et al. 
2008; Liu et al. 2011; Marchini et al. 2005; Wongseree 
et al. 2009) or only SNP pairs where at least one 
SNP in each pair exhibits a marginal single-locus effect 
(Evans et al. 2006; lonita and Man 2006; Liu et al. 2011; 
Marchini et al. 2005; Sha et al. 2009). Exhaustive two- 
locus analysis is generally required when pure epistasis 
(Culverhouse et al. 2002) is present. This is because each 
interacting SNP in a purely epistatic model exhibits no 
marginal single-locus effect. Although the importance of 
pure epistasis remains in question (Cordell 2009), many 
genetic association studies reveal that putatively pure 
epistasis plays a role in determining disease susceptibil- 
ity (Cho et al. 2004; Jiang and Neapolitan 2012; Zhang 
et al. 2008). 

In addition to epistasis, two-locus (Hallgrimsdottir 
and Yuster 2008; Li and Reich 2000; Neuman and Rice 
1992; Schork et al. 1993) and multi-locus disease mod- 
els (Edwards et al. 2009; Lunetta et al. 2004) also 
describe other phenomena. One particular phenomenon 
that makes the capture of genetic factors responsible for 
complex diseases a difficult task is genetic heterogeneity. 
Basically, genetic heterogeneity models define indepen- 
dent effects that cause the same complex disease. Since 
it is impossible to know beforehand that each affected 
individual participating in genetic association studies is 
predisposed to which independent effect, the presence 



of genetic heterogeneity always leads to the reduction in 
statistical power to detect causative SNPs (Edwards et al. 
2009; Lunetta et al. 2004; Meng et al. 2009; Ritchie et al. 
2007; Ritchie et al. 2003). 

From a machine learning viewpoint, the identification 
of causative SNPs among available SNPs in genetic asso- 
ciation studies can be treated as an attribute selection 
problem. The aim of attribute selection is to identify infor- 
mative attributes necessary for the correct classification of 
recruited samples. Saeys et al. (2007) categorise attribute 
selection techniques into three main approaches: filter, 
wrapper and embedded approaches. The filter approach 
interests in identifying SNPs associated with the disease 
according to a statistical or mathematical measure. The 
wrapper approach attempts to search for the best SNP 
combination that provides the highest prediction accu- 
racy dictated by a classifier. The embedded approach 
uses available SNPs to construct a prediction model while 
simultaneously prioritises informative SNPs. 

Among the wrapper techniques, a technique which 
is proven to be capable of detecting pure epistasis in 
the presence of genetic heterogeneity is a multifactor 
dimensionality reduction (MDR) technique (Edwards et 
al. 2009; Ritchie et al. 2007; Ritchie et al. 2003). MDR 
searches for the best SNP combination that yields the 
highest prediction accuracy according to the rules gov- 
erned by multi-dimensional decision tables (Ritchie et al. 
2001). Although the detection power of MDR is high, the 
demonstration has been limited to simulations consisting 
of two independent purely epistatic two-locus interac- 
tions. Moreover, MDR is a time-consuming technique 
and hence requires large computational efforts for multi- 
locus analysis in genetic association studies with a large 
number of SNPs (Edwards et al. 2009; Kwon et al. 2012; 
Pattin and Moore 2008; Ritchie et al. 2001; Wongseree 
et al. 2009). 

Similar to MDR, a random forest (RF) is an embed- 
ded technique which is also proven to be capable of 
detecting epistasis in the presence of genetic heterogene- 
ity (Lunetta et al. 2004; Meng et al. 2009). RF consists 
of multiple decision trees in which each tree is ran- 
domly constructed from available SNPs. Causative SNPs 
can be identified by permuting the genotype of each 
SNP and observing how this affects the overall predic- 
tion accuracy (Breiman 2001). The detection power of 
RF has been demonstrated through simulations involving 
multiple independent epistatic multi-locus interactions. 
Nonetheless, the previous studies concentrate on epis- 
tasis with marginal single-locus effects. As a result, the 
ability of RF to detect pure epistasis has not yet been 
determined. 

Unlike genetic association studies that use wrapper 
and embedded techniques, most studies involving filter 
techniques rarely consider scenarios which cover genetic 
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heterogeneity. However, one filter technique which should 
be suitable for detecting pure epistasis in the presence of 
genetic heterogeneity is an omnibus permutation test on 
ensembles of two-locus analyses or 2LOmb ( Wongseree et 
al. 2009). 2LOmb exhaustively performs two-locus analy- 
sis on case-control SNP data by x"^ tests. The best ensem- 
ble of SNP pairs is then progressively constructed where 
the statistical significance of the association between the 
ensemble and the disease is determined by a permuta- 
tion test. 2LOmb is suitable for detecting purely epistatic 
two-locus interactions and purely epistatic multi-locus 
interactions with marginal two-locus effects (Wongseree 
et al. 2009). In addition, 2LOmb has been successfully 
benchmarked against an exhaustive two-locus analysis 
technique, a set association approach (Hoh et al. 2001), 
a correlation-based feature selection technique (Hall and 
Holmes 2003) and a tuned ReliefF technique (Moore 
and White 2007). Although the study has been con- 
ducted without considering genetic heterogeneity, the 
result from an application of 2LOmb to a real case-control 
data set, derived from a genome-wide data set by focus- 
ing on SNPs within or near candidate genes, suggests 
that 2LOmb can function when genetic heterogeneity is 
present. Previously, 2LOmb identifies 11 intronic SNPs 
which exhibit no marginal single-locus effects and are 
associated with type 2 diabetes mellitus (T2D) in a UK 
population (The Wellcome Trust Case Control Consor- 
tium 2007): four SNPs in PGMl (phosphoglucomutase 
1), two SNPs in LMXIA (LIM homeobox transcription 
factor 1, alpha), two SNPs in PARK2 (parkinson pro- 
tein 2, E3 ubiquitin protein ligase (parkin)) and three 
SNPs in GYS2 (glycogen synthase 2 (liver)). The results 
also suggest that there are no interactions between genes 
(Wongseree et al. 2009). Obviously, this finding signi- 
fies the power of 2LOmb to detect genetic heterogene- 
ity. Nevertheless, a thorough investigation by simulations 
is still required. In addition, the possibility of apply- 
ing 2LOmb to a genome-wide data set also needs to be 
explored. 

In this article, the ability of 2LOmb to detect pure epis- 
tasis in the presence of genetic heterogeneity is demon- 
strated. 2LOmb is benchmarked against MDR and RF 
in various simulation scenarios generated by varying the 
number of available SNPs, number of causative SNPs and 
ratio of case samples in which the disease status is gov- 
erned by different purely epistatic interaction models. The 
statistical power of 2LOmb to directly identify the num- 
ber of independent interactions in simulated data from 
its output is subsequently evaluated. The application of 
2LOmb to a genome- wide type 1 diabetes mellitus (TID) 
data set is also included. In this study, the genome-wide 
TID data set is chosen instead of the T2D data set because 
2LOmb does not detect any purely epistatic interactions 
in the T2D data set. 



Results and discussion 

Testing with small-scaled simulated data 

2LOmb is benchmarked against MDR and RF in a simu- 
lation trial involving both pure epistasis and genetic het- 
erogeneity. An output from an efficient algorithm should 
contain a low number of SNPs and a high number of 
correctly-identified causative SNPs. These two measures 
on the number of SNPs are the performance indicators. 
Each simulated data set contains 20 or 1,000 unlinked 
SNPs in which two independent purely epistatic inter- 
actions are present. Each interaction is based on one 
of the models investigated by Wongseree et al. (2009) 
and is governed by two, three or four causative SNPs. 
As a result, the interesting numbers of causative SNPs 
in each data set are 4 (2&2), 5 (2&3), 6 (3&3 or 
2&4), 7 (3&4) and 8 (4&4). The allele frequencies of 
all causative SNPs are 0.5; these are dictated by the 
purely epistatic models with penetrance tables derived by 
Culverhouse et al. (2002) and Wongseree et al. (2009). On 
the other hand, the minor allele frequencies (MAFs) of 
the remaining SNPs are between 0.05 and 0.5; these con- 
form to the MAFs of SNPs targeted by the International 
HapMap Project (The International HapMap Consortium 
2005). The allele frequency setting is similar to that in the 
early study by Wongseree et al. (2009). The data set con- 
sists of balanced case-control samples of size 1,600. All 
SNPs in control samples are in Hardy-Weinberg equilib- 
rium. The case samples are drawn from two independent 
groups of affected individuals where the disease status of 
each individual from the same group is the result of the 
same purely epistatic interaction. This leads to the pres- 
ence of genetic heterogeneity. The interesting ratios of 
case samples from two affected groups are 1:1 and 1:3. 
The genotype distribution of causative SNPs that produce 
an independent interaction follows the purely epistatic 
model, leading to the heritability of 0.01. Thirty indepen- 
dent data sets for each simulation setting are generated by 
genomeSIM (Dudek et al. 2006). Since the same simulated 
data sets are used during the benchmarking, a paired t- 
test can be applied to assess the significance of difference 
in algorithm performance. 

The results from the problems with 20 and 1,000 SNPs 
in data are shown in Figures 1 and 2, respectively. It can be 
seen from Figure 1 that MDR fails to detect some causative 
SNPs. This suggests that the detection capability of MDR 
is lower than that of both RF and 2LOmb. Since it is highly 
unlikely that the MDR performance can improve when 
the number of available SNPs increases, the MDR simula- 
tion for the problem with 1,000 SNPs is not carried out. 
As a result, the MDR simulation is limited to the problem 
with 20 SNPs, which is similar to the study by Edwards 
et al. (2009). In Figures 1 and 2, the parameter setting of 
1:3 for the ratio of case samples from two affected groups 
leads to two sets of results if the numbers of causative 
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Figure 1 Performance of MDR, RF and 2L0mb in the problem 
with 20 SNPs. The results are averaged over 30 independent 
simulations. MDR explores only models that do not contain more 
than 1 0 SNPs. The MDR output contains the most parsimonious SNP 
combination that yields the maximum prediction accuracy. The 
number of trees in RF is set to 1 00. The RF output consists of 
top-ranked SNPs, which are SNPs with variable importance in the top 
five percentiles of a normal distribution (StrobI et al. 2009). 
Association detection is declared for 2L0mb if the global p-value 
used as the detection indicator in its result is less than 0.05. The 
results from MDR, RF and 2L0mb are displayed using red diamond, 
blue triangle and black square markers, respectively. In each chart, the 
meeting point between two dotted lines denotes the graphical 
location representing ideal performance of the algorithm. Ideally, the 
algorithm should report only the causative SNPs in its output. In other 
words, both number of output SNPs and number of 
correctly-identified causative SNPs should be equal to the number of 
causative SNPs. The charts on which the red diamond markers are 
invisible denote the situations in which the performance of MDR and 
2L0mb is similar. 



SNPs responsible for the two independent interactions are 
not equal The first set of results is obtained when the 
low-order interaction is responsible for the affected status 
of individuals from the small-proportion group. On the 
other hand, the second set of results is obtained when the 
low-order interaction is responsible for the affected status 
of individuals from the large-proportion group. 2LOmb 
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Figure 2 Performance of RF and 2L0mb in the problem with 
1,000 SNPs. The number of trees in RF is set to 1,000. The 
explanation for how the results are obtained and displayed is the 
same as that given in Figure 1 . The charts in this figure are displayed 
using a coarser scale than the charts in Figure 1 . 



significantly outperforms MDR and RF in terms of the low 
number of output SNPs, the high number of correctly- 
identified causative SNPs or both in the problems with 
20 and 1,000 SNPs (a paired /:-test on 15 x 30 = 450 
benchmark results for each problem yields a /7-value < 
0.05). The statistical power analysis also reveals that the 
benchmark trial with 30 independent data sets for each 
simulation setting is sufficient for an accurate evaluation 
of the overall algorithm performance (power > 0.95 for 
a type I error rate of 0.05). The simulation results can be 
further interpreted as follows. 

MDR functions by attempting to identify a SNP com- 
bination which leads to the maximum prediction accu- 
racy. In the presence of genetic heterogeneity, multiple 
SNP combinations are required where each combination 
is needed for the correct class prediction of a portion 
of case-control samples. If the proportions of samples 
at which their class labels can be predicted by different 
SNP combinations are equal, then each causative SNP 
contributes equally towards achieving high prediction 
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accuracy. Subsequently, MDR is able to detect all or 
almost all causative SNPs. On the other hand, if the pro- 
portions of samples are not equal, then the attained pre- 
diction accuracy depends more on the ability to classify 
samples that occupy the large proportion. In other words, 
the inclusion of SNP combination necessary for the iden- 
tification of class labels of samples that occupy the small 
proportion does not lead to an improvement of prediction 
accuracy. As a result, MDR fails to uncover some causative 
SNPs when the ratio of case samples from two affected 
groups is 1:3. 

RF identifies causative SNPs by permuting the genotype 
of the interesting SNP and monitoring how it affects the 
prediction accuracy. It is aimed that the reduction in pre- 
diction accuracy as a result of the genotypic change of the 
causative SNP is more prominent than that of other SNPs. 
Although this is an efficient strategy, RF also selects erro- 
neous SNPs as causative SNPs. This is observed from the 
number of output SNPs reported by RF which is greater 
than the number of correctly-identified causative SNPs. 
The number of erroneous SNPs increases drastically when 
the number of available SNPs increases from 20 to 1,000. 
Moreover, the number of correctly-identified causative 
SNPs also markedly decreases. This can be explained from 
the manner at which each tree is constructed. Basically, 
the tree construction begins by assigning a SNP, which 
provides the best split, from a randomly chosen SNP set 
as the root node, creating a split according to the genotype 
and sorting samples to the appropriate descendant node. 
This process is repeated until each final descendant node 
is assigned with samples from the same class or the maxi- 
mum tree size, dictated by the number of available SNPs, 
is reached. Permuting the genotype of a SNP located at or 
near the root node produces a large effect on the predic- 
tion accuracy while permuting the genotype of a SNP in a 
descendant node produces a small effect. Since the chance 
that a causative SNP being located near or at a root node is 
small when the problem contains a large number of SNPs, 
the variable importance of causative SNPs obtained by 
the genotype permutation may not be markedly different 
from that of other SNPs. This subsequently leads to the 
degradation of the RF performance. Although the perfor- 
mance of RF can be improved by increasing the number 
of trees in the forest (Strobl and Zeileis 2008), it also leads 
to an increase in computational time. The computational 
time result for RF, which will be discussed later, provides 
evidence against the option of increasing the number of 
trees for this study. 

As mentioned earlier, 2LOmb produces the best results 
among three techniques in the benchmark trial. 2LOmb is 
capable of detecting most of causative SNPs in every sim- 
ulated data set. This performance is further strengthened 
by highly significant /^-values (2LOmb s global ^-value < 
0.0001) and the presence of common SNPs among some 



or all SNP pairs that are parts of three- and four-locus 
interactions in the 2LOmb results. Nonetheless, some 
causative SNPs are missing from the 2LOmb output. Since 
the study is carried out by varying the number of avail- 
able SNPs, the number of causative SNPs and the ratio 
of case samples from two affected groups, these param- 
eters may influence the number of correctly-identified 
causative SNPs. The parameter analysis is divided into 
two parts. The first part concentrates on the results from 
problems where the numbers of causative SNPs responsi- 
ble for two independent interactions are equal while the 
second part concentrates on those where the numbers of 
causative SNPs are not equal. The analysis is divided in 
this manner because as mentioned earlier the parame- 
ter setting of 1:3 for the ratio of case samples from two 
affected groups leads to two sets of results in only the 
second part of the analysis. From both parts, analysis 
of variance (ANOVA) reveals that all three parameters 
are the sources of variation which significantly affect the 
number of correctly-identified causative SNPs {p < 0.05). 

It is observed that the number of correctly-identified 
causative SNPs decreases when the number of available 
SNPs is large. This is to be expected because the Bon- 
ferroni correction factor is a quadratic function of the 
number of available SNPs. An increase in the Bonferroni 
correction factor leads to an increase in the Bonferroni- 
corrected x^'s p-vdlue. If there are not enough sam- 
ples for the two-locus analysis to produce a sufficiently 
low Bonferroni-corrected x^'s p-vslue, some causative 
SNPs may be excluded from the output ensemble. This 
is highly evident when the ratio of case samples from 
two affected groups is 1:3. The variation in the ratio 
of case samples also leads to a change in the number 
of correctly-identified causative SNPs. The magnitudes 
of Bonferroni-corrected s p-y slues for causative SNP 
pairs are similar when the numbers of case samples from 
two affected groups are equal. All causative SNPs can gen- 
erally be identified in this scenario. On the other hand, the 
Bonferroni-corrected x'^'s /^-values for SNP pairs respon- 
sible for the affected status of individuals from the small- 
proportion group are higher than those for SNP pairs 
responsible for the affected status of individuals from 
the large-proportion group. As a result, the exclusion of 
causative SNP pairs with insufficiently low Bonferroni- 
corrected x^ s p'Ydlues from the output ensemble leads to 
a decrease in the number of correctly-identified causative 
SNPs. 

In contrast to the first two parameters, an increase in 
the number of causative SNPs leads to an increase in 
the number of correctly-identified causative SNPs. This 
phenomenon can be explained as follows. To identify a 
multi-locus interaction, a number of SNP pairs must be 
included in the output ensemble. For instance, an ensem- 
ble of three SNP pairs namely (SNPl, SNP2), (SNP2, 
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SNP3) and (SNPl, SNP3) leads to the identification of a 
three-locus interaction between SNPl, SNP2 and SNP3. 
However, only two-out-of-three SNP pairs are necessary 
for the correct interaction identification. Similarly, only 
three-out-of-six possible SNP pairs are necessary for the 
correct identification of a four-locus interaction. In other 
words, the number of redundant SNP pairs increases as 
the order of interaction increases. Hence, the number of 
correctly-identified causative SNPs increases when there 
are more redundant SNP pairs, which can be omitted from 
the output ensemble. 

In addition to the superiority in terms of the number 
of output SNPs and the number of correctly-identified 
causative SNPs, the computational time for 2LOmb 
analysis is tractable. The order of growth in 2LOmb com- 
putational time is 0(mn^) where m is the sample size and 
n is the number of available SNPs. The computational 
time for RF analysis is also tractable. However, the order 
of growth in RF computational time is 0(mlog(m)yfnf) 
where/ is the number of trees, signifying that the required 
computational time also depends on the algorithm setting 
(Guyon and Elisseef 2006). In contrast, the tractability of 
MDR depends on the maximum size of explored mod- 
els. If MDR explores all possible SNP combinations, the 
order of growth in MDR computational time is 0(m2^), 
which makes the computational time becomes intractable. 



On the other hand, if MDR explores only models that do 
not contain more than ris SNPs where ris < n, the order 
of growth in MDR computational time is 0{mn^^), which 
means that the computational time is tractable. Since the 
computational time required by 2LOmb, RF and MDR 
with the latter setting is all tractable, the comparison of 
computational time is hence carried out. MDR explores 
only models that do not cover more than 10 SNPs in the 
20-SNP data sets. An MDR permutation test is also omit- 
ted because it requires large computational efforts and 
is only performed to assess the probability that the null 
hypothesis of no association is true. The summary of com- 
putational time required by all three techniques is given in 
Table 1. RF uses lesser computational time than 2LOmb 
while MDR uses more computational time than 2LOmb 
to analyse 20-SNP data sets. However, the computational 
time required by RF to analyse 1,000-SNP data sets is 
greater than that required by 2LOmb. In addition, by lim- 
iting the MDR analysis to the exploration of models that 
do not cover more than 10 SNPs, it is estimated using the 
present MDR result that the computational time required 
by MDR to analyse a 1,000-SNP data set is 2.75 x 10^^ 
seconds. The present MDR result also suggests that the 
computational time required by MDR to perform a per- 
mutation test using 1,000 permutation replicates on a 
20-SNP data set is 6.37 x 10^ seconds. The estimation 



Table 1 Computational time required by MDR, RF and 2L0mb to analyse small-scaled simulated data sets with different 
numbers of available SNPs, different numbers of causative SNPs and different ratios of case samples from two affected 
groups 



Number of 


Ratio 






Computational time (sec) 






causative 


of case 


IVIDR 




RF 




2L0mb 


SNPs 


samples 


20 SNPs 


20 SNPs 


1,000 SNPs 


20 SNPs 


1,000 SNPs 


2&2 


1:3 


6,505 


2 


539 


5 


24 




1:1 


6,434 


2 


529 


6 


23 


3&3 


1:3 


6,573 


2 


529 


13 


32 




1:1 


6,611 


2 


531 


14 


32 


4&4 


1:3 


6,372 


2 


534 


32 


45 




1:1 


6,528 


2 


538 


27 


46 


2&3 


1:3 


6,637 


2 


529 


12 


32 




1:1 


6,644 


3 


527 


10 


30 




3:1 


6,776 


2 


528 


10 


28 


2&4 


1:3 


6,513 


2 


525 


16 


35 




1:1 


6,637 


2 


528 


16 


35 




3:1 


6,599 


2 


528 


18 


34 


3&4 


1:3 


6,369 


2 


526 


22 


38 




1:1 


6,410 


2 


530 


25 


45 




3:1 


6,435 


2 


528 


22 


38 



The simulation is carried out on a computer server. The computer server is equipped with a Xeon 2.66 GHz quad-core processor and 4GB of main memory. A CentOS 
5.5 operating system is installed on the computer server. The computational time is collected from the processing of multiple independent data sets for each 
simulation setting. The displayed time is the maximum time required by each algorithm to analyse one data set. 
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of computational time conforms to the results from early 
reports (Edwards et al. 2009; Pattin and Moore 2008; 
Ritchie et al. 2001; Wongseree et al. 2009). This means 
that a direct application of MDR and RF used in this study 
(see Methods for details) to larger data sets in which all 
SNPs exhibit no marginal single-locus effects is certainly 
impractical Overall, 2LOmb outperforms MDR and RF in 
this study. There are many attribute selection techniques 
that have been successfully applied to genetic association 
studies (Heidema et al 2006; Motsinger et al 2007; Van 
Steen 2012). It would be interesting to benchmark 2LOmb 
against other techniques that can also be applied to data 
containing pure epistasis (Culverhouse 2012; Jiang et al 
2011b; Zhang and Liu 2007) and genetic heterogeneity 
(Culverhouse 2012). 

Another advantage of using 2LOmb for detecting pure 
epistasis in the presence of genetic heterogeneity is the 
ability to identify the number of independent interactions. 
This is possible because 2LOmb reports its output in the 
form of an ensemble of SNP pairs. If there are common 
SNPs between pairs, then the detection of a multi-locus 
interaction is declared. On the other hand, the absence 
of common SNPs between pairs signifies that the interac- 
tions are independent. For example, an ensemble that con- 
tains SNP pairs (SNPl, SNP2), (SNPS, SNP4) and (SNP4, 
SNPS) indicates the presence of genetic heterogeneity in 
which a two-locus interaction between SNPl and SNP2 
and a three-locus interaction between SNPS, SNP4 and 
SNPS are independently responsible for the disease status 
of each individual. Obviously, it is impossible to directly 
identify the number of independent interactions from 
the MDR and RF results because both techniques report 
their outputs in the form of a set of SNPs and not a set 
of SNP pairs. To demonstrate this capability of 2LOmb, 
the previously described simulation is extended where 
the number of independent data sets for each simulation 
setting increases from 30 to 100. The portions of inde- 
pendent data sets in which 2LOmb can identify at least 
one interaction and both interactions in the data sets are 
obtained for the calculation of statistical power. Detec- 
tion of one interaction is declared if 2LOmb correctly 
identifies at least two interacting causative SNPs respon- 
sible for the affected status of individuals from only one 
case group. On the other hand, detection of two inter- 
actions is declared if 2LOmb correctly identifies at least 
four interacting causative SNPs in the form of two SNP 
pairs without a common SNP among the pairs. In addi- 
tion, each SNP pair must be responsible for the affected 
status of individuals from a different case group. The sta- 
tistical power to detect genetic heterogeneity summarised 
in Table 2 indicates that 2LOmb can identify both inter- 
actions in nearly all 20-SNP data sets. However, a loss of 
statistical power to detect both interactions is observed 
when the number of available SNPs increases from 20 to 



1,000. In particular, this occurs when the ratio of case 
samples from two affected groups is 1:S and a two-locus 
interaction is responsible for the affected status of indi- 
viduals from the small-proportion group. This conforms 
to the early observation regarding the effects of increasing 
the number of available SNPs and increasing the number 
of causative SNPs on the number of correctly-identified 
causative SNPs. In brief, the Bonferroni correction factor 
increases when the number of available SNPs increases. If 
the Bonferroni-corrected s /7-value for a causative SNP 
pair is not low enough, this pair would be excluded from 
the output ensemble. Subsequently, a failure to identify 
the causative SNP pair that is solely responsible for the 
affected status of individuals from the small-proportion 
group leads to the reduction in statistical power to detect 
both interactions. 

Testing with large-scaled simulated data 

In this part of the study, each simulated data set contains 
10,000 or 100,000 unlinked SNPs where two independent 
purely epistatic two-locus interactions are present. Only 
the setting of two independent two-locus interactions is 
considered because the early simulation results given in 
Table 2 indicate that this scenario is the most difficult one 
when the number of available SNPs is large. The allele fre- 
quencies of all causative SNPs are O.S while the MAFs of 
the remaining SNPs are between O.OS and O.S. The data 
set consists of balanced case-control samples of size 1,600, 
S,200 or 6,400. All SNPs in control samples are in Hardy- 
Weinberg equilibrium. The case samples are drawn from 
two independent groups of affected individuals where the 
ratios of samples from two affected groups are 1:1 and 
1:S. The genotype distribution of interacting causative 
SNPs follows the purely epistatic model which gives the 
heritability of 0.01. One hundred independent data sets 
for each simulation setting are generated by genomeSIM 
for the evaluation of statistical power to detect genetic 
heterogeneity. 

The summary of statistical power to detect genetic het- 
erogeneity in Table S indicates that the ability to detect 
both interactions is highest when the ratio of case sam- 
ples from two affected groups is 1:1 and is lowest when 
the sample size is 1,600 and the ratio of case samples 
from two affected groups is 1:3. This conforms to the 
early observation where a similar phenomenon is detected 
when the number of available SNPs is 1,000. However, 
once the sample size is doubled and quadrupled, the abil- 
ity to detect both interactions increases significantly. Both 
interactions can be detected in almost all data sets con- 
taining 10,000 or 100,000 SNPs and 3,200 samples while 
both interactions can be detected in all data sets contain- 
ing 10,000 or 100,000 SNPs and 6,400 samples. Obviously, 
an increase in sample size causes an increase in the 
test statistic during the two-locus analysis of causative 
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Table 2 Statistical power of 2L0mb to detect genetic heterogeneity in small-scaled simulated data sets with different 
numbers of available SNPs, different numbers of causative SNPs and different ratios of case samples from two affected 
groups 

Statistical power 
20 SNPs 1,000 SNPs 



Number of 

causative 
SNPs 


Ratio 
of case 
samples 


At least 
one interaction 
detected 


Two 
interactions 
detected 


At least 
one interaction 
detected 


Two 
interactions 
detected 


2&2 


1:3 


1.00 


0.95 


1.00 


0.55 




1:1 


1.00 


1.00 


1.00 


1.00 


3&3 


1:3 


1.00 


1.00 


1.00 


0.88 




1:1 


1.00 


1.00 


1.00 


1.00 


4&4 


1:3 


1.00 


1.00 


1.00 


1.00 




1:1 


1.00 


1.00 


1.00 


1.00 


2&3 


1:3 


1.00 


0.93 


1.00 


0.60 




1:1 


1.00 


1.00 


1.00 


1.00 




3:1 


1.00 


0.98 


1.00 


0.94 


2&4 


1:3 


1.00 


0.94 


1.00 


0.63 




1:1 


1.00 


1.00 


1.00 


1.00 




3:1 


1.00 


1.00 


1.00 


0.99 


3&4 


1:3 


1.00 


1.00 


1.00 


0.88 




1:1 


1.00 


1.00 


1.00 


1.00 




3:1 


1.00 


1.00 


1.00 


0.97 



Each data set consists of balanced case-control samples of size 1 ,600. The results indicate that 2L0mb detects at least one interaction in every data set (global p-value 
< 0.0001). 



SNPs. This suggests that increasing the sample size leads 
to a lower Bonferroni-corrected x'^'s p-vdlue for the SNP 
pair responsible for the affected status of individuals from 
the small-proportion group. Subsequently, the chance this 
SNP pair being included in the output ensemble increases. 

Bonferroni correction during the two-locus analysis 
plays an important role in keeping the number of output 
SNPs reported by 2LOmb close to the number of causative 
SNPs (Wongseree et al. 2009). However, the overly con- 
servative nature of Bonferroni correction when the num- 
ber of statistical tests is large (Jiang et al. 2011a) also 
leads to the aforementioned limitation in 2LOmb's ability 
to detect both independent interactions when the ratio 
of case of samples from two affected groups is 1:3. 
Although increasing the sample size is a possible solution, 
other multiple testing correction techniques can be used 
instead of the Bonferroni correction to tackle this prob- 
lem. For instance, false discovery rate (FDR) analysis is a 
strong candidate and is proven to be appropriate for DNA 
microarray data analysis (Storey and Tibshirani 2003) and 
genome-wide association studies (The Diabetes Genetics 
Replication and Meta-analysis Consortium 2012). Further 
studies are required to determine the effect of replacing 
the Bonferroni correction in the two-locus analysis within 
2LOmb with the FDR analysis. 



The computational time summarised in Table 4 indi- 
cates that the computational time for the large-scaled 
simulation is a linear function of sample size. This is to 
be expected because the construction of a 2 x 9 con- 
tingency table for each two-locus analysis requires the 
assignment of samples to the appropriate cells in the 
table, which is a linear-time operation. On the other 
hand, the computational time is a quadratic function of 
the number of available SNPs. Since the basic operation 
of 2LOmb is the two-locus analysis, 2LOmb can tackle 
large-scaled problems with fixed sample size and var- 
ied number of SNPs in quadratic time (Wongseree et al. 
2009). Overall, the result agrees with the order of growth 
in computational time discussed in the small-scaled sim- 
ulation section. Based on the computational time given 
in Table 4, it is estimated that 2LOmb requires 3.14 x 
10^ seconds (87.2 hours) of computational time to com- 
plete the analysis of a data set containing 500,000 SNPs 
and 6,400 samples. This suggests that the computational 
time of 2LOmb is tractable for genome-wide association 
studies. 

Analysis of type 1 diabetes mellitus data 

The presence of pure epistasis and genetic heterogeneity 
in a TID data set is identified using 2LOmb. The data set, 
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Table 3 Statistical power of 2L0mb to detect genetic heterogeneity in large-scaled simulated data sets with different 
numbers of available SNPs, different sample sizes and different ratios of case samples from two affected groups where 
the affected status is governed by a two-locus interaction 

Statistical power 



10,000 SNPs 100,000 SNPs 





Ratio 


At least 


Two 


At least 


Two 


Sample 


of case 


one interaction 


interactions 


one interaction 


interactions 


size 


samples 


detected 


detected 


detected 


detected 


1,600 


1:3 


1.00 


0.30 


1.00 


0.13 




1:1 


1.00 


1.00 


1.00 


1.00 


3,200 


1:3 


1.00 


0.98 


1.00 


0.92 




1:1 


1.00 


1.00 


1.00 


1.00 


6,400 


1:3 


1.00 


1.00 


1.00 


1.00 




1:1 


1.00 


1.00 


1.00 


1.00 



The results indicate that 2L0mb detects at least one interaction in every data set (global p-value < 0.0001). 



which is collected and screened by the Wellcome Trust 
Case Control Consortium (WTCCC), consists of 1,963 
case samples and 2,938 control samples. The case sam- 
ples are collected from affected individuals in the UK 
while the control samples are the results of the merging 
between 1,458 samples from the UK blood services and 
1,480 samples from the 1958 British birth cohort. The data 
set contains 469,557 SNPs, which are genotyped through 
the Affymetrix GeneChip 500K Mapping Array Set and 
pass the WTCCC quality control (The Wellcome Trust 
Case Control Consortium 2007). The SNP set is primar- 
ily reduced by screening for SNPs within or near genes 

Table 4 Computational time required by 2L0mb to analyse 
large-scaled simulated data sets with different numbers of 
available SNPs, different sample sizes and different ratios 
of case samples from two affected groups where the 
affected status is governed by a two-locus interaction 

Sample Ratio of Computational time (sec) 



size case samples 10,000 SNPs 100,000 SNPs 



1,600 


1:3 


34 


3,106 




1:1 


34 


3,116 


3,200 


1:3 


68 


6,227 




1:1 


68 


6,256 


6,400 


1:3 


135 


1 2,503 




1:1 


136 


1 2,560 



The simulation is carried out on a computer system witli a grapliics processing 
unit. The parallelism of the graphics processing unit is exploited to speed up the 
computation. The computer system is equipped with an AMD 2.8 GHz quad-core 
processor, 4GB of main memory and an NVIDIA GeForce GTX 285 graphics 
processing unit. The graphics processing unit contains 240 streaming processors 
sharing 1 GB of GDDR3 memory. Each streaming processor has a clock rate of 
1 .48 GHz. An Ubuntu 9.1 0 operating system is installed on the computer system. 
The computational time is collected from the processing of multiple 
independent data sets for each simulation setting. The displayed time is the 
maximum time required to analyse one data set. 



(Herold et al 2009; Ritchie 2011) according to NCBI build 
36.3 (dbSNP bl29) coordinates. SNPs that are near a gene 
are located within 2,000 bases upstream of the start site 
or 500 bases downstream of the termination site for tran- 
scription. The SNP set is further reduced by removing 
SNPs that exhibit marginal single-locus effects or have 
MAFs below 0.1. SNPs that the genotype distribution 
within control samples departs from Hardy- Weinberg 
equilibrium are also discarded. The final SNP set con- 
tains 95,991 SNPs with no marginal single-locus effects 
(uncorrected x^'s p-vslue > 0.05) from 12,146 genes. 

The analysis of the reduced TID data set by 2LOmb 
takes 8,862 seconds (2.46 hours) of computational time on 
the computer system with a graphics processing unit (see 
Table 4 for detailed computer specification). The possi- 
ble genetic association is detected from 12 SNPs located 
within or near five genes (global p-vdlue < 0.0001). Details 
of these SNPs, the SNP pairs that exhibit marginal two- 
locus effects and the identified genes are given in Table 5. 
Linkage disequilibrium (LD) analysis is subsequently per- 
formed using JLIN (Carter et al. 2006) and the LD patterns 
are shown in Figure 3. All SNPs within or near the same 
gene are in LD due to high values of and r^. This 
is most likely being the cause of the identification of 
multiple SNPs from the same gene. On the other hand, 
SNPs in each pair that contains SNPs from different genes 
(SNP pairs 2-18) are not in LD due to low values of 
and P'. There are several subsets of these SNP pairs in 
which each subset contains three SNP pairs with com- 
mon SNPs between pairs. One example is {(SNPl, SNP5), 
(SNPl, SNP7), (SNPl, SNP9)}. Consequently, the detec- 
tion of these 17 SNP pairs indicates that there is a four- 
locus interaction between MZiC2i (mucin 21, cell surface 
associated), MUC22 (mucin 22), PSORSICI (psoriasis 
susceptibility 1 candidate 1) and TCF19 (transcription 
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Table 5 2L0mb identifies 1 2 SNPs, which are located within or near five genes, from the reduced T1 D data set 



Gene 


Chromosome 
and location 


SNP 
no. 


SNP 
















SNP pair in the ensemble 














1 2 


3 


4 


5 


6 


7 


8 


9 10 11 12 13 14 


15 


16 


17 


18 


19 


20 


MUC21 


6p21.32 


1 


rs2844678 


• 


• 








• 


• 


• 


















2 


rs2523929 






• 


• 








• • • 


















3 


rs2530699 










• 






• • 


• 












MUC22 


6p21.33 


4 


rs9262546 


• 


















• 


• 


• 










5 


rs6933349 


• • 




• 




• 
























6 


rs47 13423 




• 




• 






















PSORSICI 


6p21.3 


7 


rs9263715 












• 




• • 




• 














8 


rs9263716 














• 


• • 






• 








TCF19 


6p21.3 


9 


rs9263794 
















• • 


• 






• 






ATADl 


10q23.31 


10 


rsl 2775041 


























• 








11 


rsl 25731 60 


























• 


• 






12 


rsl 2781 171 




























• 



Twenty SNP pairs are present in the ensemble. A pair of dots in tlie same column denotes a SNP pair. 



factor 19). In contrast, there are no interactions between 
ATADl (ATPase family, AAA domain containing 1) and 
the other genes due to the absence of a SNP pair con- 
taining a SNP from ATADl and a SNP from any of the 
remaining four genes. The detection of three Iinl<ed SNPs 
within ATADl is beUeved to be the result of haplotype 
effects (Epstein and Satten 2003). Altogether, this clearly 
signifies the presence of pure epistasis and genetic het- 
erogeneity. In real data analysis, the detection of a SNP 
pair that associates with the disease is insufficient to 
claim the presence of pure epistasis. If the SNP pair con- 
sists of two unlinked SNPs, then the detection of pure 
epistasis can be declared. Otherwise, the detection is the 
result of LD between SNPs. Since 2LOmb analysis cannot 
solely distinguish genetic association due to pure epistasis 
from genetic association due to LD, it is crucial to always 
perform additional LD analysis. 

The first four genes identified by 2LOmb, namely 
MUC21, MUC22, PSORSICI and TCF19, are located on 
the major histocompatibility complex (MHC). MHC is 
a genomic region in which a mouse model of human 
complex diseases suggests the presence of TID suscepti- 
bility genes (Cordell et al. 2001). The four genes are also 
located between DDRl (discoidin domain receptor tyro- 
sine kinase 1) and HLA-DQAl (major histocompatibility 
complex, class II, DQ alpha 1), which is the region where 
the DR3-DQ2 ancestral haplotype 18.2 (AH18.2) is proven 
to be highly conserved and likely to carry susceptibility 
alleles for TID in a Spanish population (Santiago et al. 
2009). This implies that the detection of a four-locus inter- 
action between these four genes conforms to the evidence 
from early genetic association studies of TID. On the 
other hand, there are no early reports regarding the asso- 
ciation between ATADl polymorphisms and TID. ATADl 



is among many candidate genes for the association studies 
of Parkinsons disease. Nonetheless, there is little infor- 
mation about pathways that include ATADl (Moran and 
Graeber 2008). Hence, it is impossible to explain the asso- 
ciation between ATADl polymorphisms and TID at this 
point. 

This study produces evidence of association between 12 
SNPs within or near MUC22, PSORSICI, TCF19 

2ind ATADl, and TID in a UK population. Although there 
are other independent genome-wide TID data sets, the 
association detection within these data sets using the pre- 
sented methodology has never been attempted. Basically, 
the methodology employed in most genome-wide asso- 
ciation studies is based on single-locus analysis (Cooper 
et al. 2008; The Wellcome Trust Case Control Consor- 
tium 2007). Since each SNP explored in the reduced TID 
data set exhibits no marginal single-locus effect, the most 
direct approach for replicating the association results pre- 
sented in this article is to perform the same association 
detection on these independent data sets. This would cer- 
tainly help to gain a further insight into the genetics of 
TID. 

Conclusions 

In this article, the detection of pure epistasis (Culver- 
house et al. 2002) in the presence of genetic heterogeneity 
is investigated. The study focuses on the capability to 
detect two independent interactions that influence the 
development of the same complex disease. Each inter- 
action can be either a purely epistatic two-locus inter- 
action or a purely epistatic multi-locus interaction in 
which the causative SNPs exhibit no marginal single- 
locus effects. The candidate techniques for the detec- 
tion benchmarking are MDR (Ritchie et al. 2001), RE 
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Figure 3 LD patterns of SNPs within or near MUC2h MUC22, 
PSORSlCh TCF19 andATADh LD is explained by D' displayed in the 
upper triangle and displayed in the lower triangle. Dark colours 
indicate high values while pale colours indicate low values. Distances 
between SNPs are given in terms of the number of base pairs. SNPl = 
rs2844678, SNP2 = rs2523929, SNP3 = rs2530699, SNP4 = rs9262546, 
SNPS = rs6933349, SNP6 = rs471 3423, SNP7 = rs926371 5, SNPS = 
rs926371 6, SNP9 = rs9263794, SNPl 0 = rsl 2775041 , SNPl 1 = 
rsl 25731 60 and SNPl 2 = rsl 2781 171. 



(Breiman 2001) and 2LOmb (Wongseree et al 2009). The 
results from various simulation scenarios indicate that 
2LOmb outperforms MDR and RF in terms of a low 
number of output SNPs and a high number of correctly- 
identified causative SNPs. These scenarios are created 
by varying the number of available SNPs in data, the 
number of causative SNPs and the ratio of case sam- 
ples from two affected groups. ANOVA reveals that all 
three simulation parameters influence the number of 
correctly-identified causative SNPs in the 2LOmb output. 
In addition to the superiority in the detection perfor- 
mance, 2LOmb is also capable of identifying the number 
of independent interactions. This is achieved through 
the identification of common SNPs among SNP pairs 
in the ensemble. The results indicate that 2LOmb is 
able to identify the presence of independent interac- 
tions even though the number of available SNPs reaches 
100,000. Moreover, this is achieved in tractable compu- 
tational time, which makes 2LOmb suitable for use in 
genome-wide association studies. 2LOmb is subsequently 



applied to a TID data set, which contains 1,963 case sam- 
ples and 2,938 control samples and is collected from a 
UK population (The Wellcome Trust Case Control Con- 
sortium 2007). The genome-wide data set is primarily 
screened for SNPs that locate within or near genes. The 
data set is further reduced by removing SNPs that exhibit 
marginal single-locus effects or have MAFs below 0.1. The 
final data set contains 95,991 SNPs from 12,146 genes. 
2LOmb identifies 12 SNPs that are associated with the 
disease. These SNPs are located within or near MUC21, 
MUC22, PSORSICI, TCF19 and ATADi. 2LOmb and LD 
analyses indicate that there is a four-locus interaction 
between MUC21, MUC22, PSORSICI and TCF19 while 
SNPs from ATADl are independently associated with the 
disease. This signifies the presence of both pure epis- 
tasis and genetic heterogeneity. The evidence of genetic 
association for these five genes provides an alternative 
explanation for the aetiology of TID in the UK popula- 
tion. It also confirms that SNPs which exhibit no marginal 
single-locus effects from a genome-wide data set can be 
useful for genetic association studies (Wongseree et al. 
2009). 

Methods 

Purely epistatic model 

A purely epistatic model is first defined by Culverhouse 
et al. (2002). The model describes an interaction between 
unlinked SNPs which leads to an epistatic effect while 
each interacting SNP exhibits no marginal single-locus 
effect. As a result, it is impossible to screen for SNPs con- 
tributing to pure epistasis by by single-locus tests for 
allelic and genotypic association. However, pure epistasis 
can be detected by multi-locus analysis. In this study, each 
model contains two, three or four causative SNPs. The 
purely epistatic three- and four-locus interaction mod- 
els also exhibit marginal two-locus effects. All models 
yield the heritability of 0.01, which implies that genetic 
factors partially contribute towards disease susceptibility. 
The penetrance tables, which define the probability that 
an individual with a specific genotype has the disease, 
for purely epistatic two-, three- and four-locus interac- 
tion models used throughout the simulations are given 
in Tables 6, 7 and 8, respectively. Detailed derivation of 
these models is given in Culverhouse et al. (2002) and 
Wongseree et al. (2009). 

genomeSIM 

genomeSIM is a software package for simulating case- 
control data in genetic association studies (Dudek et 
al. 2006). genomeSIM takes penetrance-based models as 
inputs necessary for dictating the case/control status of 
each sample. A case-control data set can be generated 
by a population-based simulation or a probability-based 
simulation. A population of genotype strings is initialised 
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Table 6 Two-locus penetrances that lead to the herltablllty 
of 0.01 



Genotype 




Penetrance of genotype 




BB 


Bb 


bb 


AA 


0 


0 


4K 


Aa 


0 


2K 


0 


00 


4K 


0 


0 



AA and BB denote homozygous wild-type genotypes. Aa and Bb denote 
heterozygous genotypes, aa and bb denote homozygous variant genotypes. All 
allele frequencies are equal (p/\ =pe = 0.5). /C= 1/201. 



according to the allele frequency of each SNP in the 
population-based simulation. Successive generations are 
subsequently created through a forward-time simulation 
by crossing the genotype strings within each generation. 
This is pursued until the predefined number of gener- 
ations is reached. On the other hand, genotype strings 
are incrementally created until the predefined numbers of 
case and control samples are obtained in the probability- 
based simulation. In this study, the probability-based 
simulation is employed to generate all case-control data 
sets. genomeSIM is available upon request to the Ritchie 
Lab, Center for System Genomics, Pennsylvania State 
University (Ritchie Lab 2013). 

Multifactor dimensionality reduction 

MDR is a wrapper technique which is capable of iden- 
tifying causative SNPs that are associated with a dis- 
ease from case-control data (Ritchie et al. 2001). MDR 
functions by attempting to identify the best SNP combina- 
tion that yields the highest prediction accuracy. The pre- 
diction accuracy is calculated by means of a 10-fold cross- 
validation. During the cross-validation, the data set is 
randomly divided into 10 folds of combined case-control 
samples in which 9 folds of samples are used to construct 
the prediction model while the remaining fold is used 
to test the model. The process of prediction model con- 
struction and testing is then repeated 10 times where for 
each time a different sample fold is chosen as the testing 
fold. The prediction model embedded in MDR is a multi- 
dimensional decision table with 3^^ cells when ric SNPs 
and all three possible genotypes according to each SNP 
are considered. Each cell in the decision table is filled with 



case and control samples for which their genotypes coin- 
cide with the cell labels. The ratio between the numbers of 
case and control samples dictates whether the genotype in 
each cell is a protective or disease-predisposing genotype. 
The prediction accuracy is then evaluated by counting the 
number of testing samples that their disease status can be 
correctly identified using the decision rules provided by 
the table. 

Similar to other wrapper techniques, the total num- 
ber of possible prediction models that MDR can explore 
is 2^ — 1 where n is the number of available SNPs 
in the data set. With the use of an exhaustive search, 
MDR can generally identify the best SNP combination 
that gives the highest prediction accuracy. However, the 
search for the best model can also be limited to mod- 
els that do not cover more than ris SNPs where ris < n. 
After exploring multiple prediction models with a fixed 
number of SNPs, MDR also returns an additional mea- 
sure called cross-validation consistency. Basically, each 
time that a testing fold is used to determine the accu- 
racy of the interesting prediction model, the attained 
accuracy can be compared with that from other models 
which have the same number of SNPs as the interesting 
model. The model with high cross-validation consistency 
is the one that consistently ranks the first in compar- 
ison to other models regardless of which testing fold 
being used. A model with high cross-validation consis- 
tency usually has high prediction accuracy. As a result, 
prediction accuracy remains the principal criterion for 
model selection while cross-validation consistency is only 
applied as an auxiliary criterion. If two or more SNP com- 
binations give the highest prediction accuracy and have 
equally high cross-validation consistency, the most par- 
simonious combination — the combination with the least 
number of SNPs — is the one chosen as the best SNP 
combination. 

A permutation test can subsequently be applied to 
estimate the probability that the null hypothesis of no 
association is true. Each permutation replicate is con- 
structed by randomly assigning the case/control status 
to each sample with the constraint that the numbers of 
case and control samples must remain unchanged. MDR 
is then performed on each permutation replicate to obtain 



Table 7 Three-locus penetrances that lead to the herltablllty of 0.01 

Penetrance of genotype 

CC Cc cc 

Genotype BB Bb bb BB Bb bb BB Bb bb 

AA 0 0 16/C 0 0 0 0 0 0 

Ao 0000 4/C 0000 

00 0 0 0 0 0 0 16/( 0 0 



AA, BB and CC denote homozygous wild-type genotypes. vAa, Bb and Cc denote heterozygous genotypes, aa, bb and cc denote homozygous variant genotypes. All 
allele frequencies are equal (p/\ =pe =pc = 0.5).K= 1/901. 



Setsirichok etal. SpringerPlus 201 3, 2:230 
http://www.springerplus.eom/content/2/1/230 



Page13of 17 



Table 8 Four-locus penetrances that lead to the herltablllty of 0.01 



Penetrance of genotype 
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Bb 
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0 




bb 
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0 


0 


0 


0 


0 



AA, BB, CC and DD denote homozygous wild-type genotypes. Aa, Bb, Cc and Dd denote heterozygous genotypes, aa, bb, cc and dd denote homozygous variant 
genotypes. All allele frequencies are equal {pA = Pb = Pc = Pd = 0.5). K= 1/3501. 



the best SNP combination together with its prediction 
accuracy and cross-validation consistency. The empirical 
P'Vdlue is given by the fraction of permutation replicates 
with the interesting measure larger than or equal to that 
obtained from the original data where the measure can be 
either prediction accuracy or cross-validation consistency 
(Hahn et al. 2003). MDR used in this study is publicly 
available from the Computational Genetics Laboratory, 
Dartmouth Medical School, Dartmouth College (Com- 
putational Genetics Laboratory at Dartmouth Medical 
School 2013). 

Random forest 

RF refers to a collection or ensemble of decision trees 
(Breiman 2001). Each tree in RF is constructed in a top- 
down manner. The tree construction begins at the root 
node where an attribute (SNP) is selected as the test. 
Descendants of the root node are then created accord- 
ing to the values of this attribute (genotypes of this 
SNP). Next, the (case-control) data samples are sorted to 
the appropriate descendant node. The entire process is 
repeated using the samples associated with each descen- 
dant node to select another attribute to test at that point 
in the tree. This forms a forward search for an accept- 
able decision tree in which the search never backtracks to 
reconsider earlier node choices. Since there are multiple 
trees in the forest, RF takes a majority vote from the trees 
as the class decision. Hence, the trees should be diverse 
in order for the majority- vote concept to be applicable. It 
is suggested that an attribute for each node in a tree can 
be selected according to its suitability for being used as 
the test from a small group of randomly picked attributes. 
Empirical studies indicate that an attribute group size of 
Vtotal number of attributes~| is sufficient. Consequently, 
the samples allocated to each descendant node, which is 



created after selecting the most suitable attribute as the 
test, have lesser class variety. Moreover, each tree in RF is 
allowed to grow to its maximum size. This does not lead 
to data over-fitting because the overall class decision relies 
on outcomes from multiple trees in the forest. 

Unlike MDR, a bootstrap aggregating or bagging 
approach provides a means to determine the prediction 
accuracy of RF. Given a (case-control) sample set, a boot- 
strap sample set with the size equals to original sample set 
is generated by sampling from the original sample set uni- 
formly and with replacement. It is expected that 63.2% of 
bootstrap samples are unique while the remaining samples 
are duplicates. Original samples that are absent from the 
bootstrap sample set are referred to as out-of-bag samples. 
Bootstrap samples are employed during the tree construc- 
tion while out-of-bag samples are used to evaluate the pre- 
diction accuracy. A new bootstrap sample set is generated 
for the construction of each tree. As a result, the votes are 
only counted across the trees that the sample is out-of-bag 
during the prediction accuracy evaluation. The applica- 
tion of a bootstrap aggregating approach also leads to a 
means to quantify attribute importance, which is com- 
monly referred to as variable importance. The variable 
importance is measured using a permutation approach. 
By randomly permuting the value of the attribute of inter- 
est, the correlation between the attribute and the (case- 
control) class can be determined. When the permuted 
attribute and the remaining non-permuted attributes are 
used as inputs for RF to identify the class of out-of-bag 
samples, the prediction accuracy reduces markedly if the 
attribute of interest is correlated with the class. The aver- 
age difference between the prediction accuracy obtained 
using the original attribute inputs and that obtained using 
the inputs with one permuted attribute over the trees 
is the variable importance. The standardised variable 
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importance is defined as the quotient between the vari- 
able importance and a standard error derived from the 
between-tree variance of the variable importance. In other 
words, the standardised variable importance follows a 
standard normal distribution (Random Forests 2004). An 
attribute with variable importance in the top five per- 
centiles of a normal distribution is considered to be in a 
top rank in comparison to other attributes and is hence 
correlated with the class. This decision criterion is similar 
to the one based on the extremity of variable importance 
suggested by Strobl et al. (2009). RF used in this study 
is publicly available from the Department of Statistics, 
University of California, Berkeley (Random Forests 2004). 
A review of RF for genetic association studies can be found 
in Goldstein et al. (2011). Interested readers should also 
refer to Schwarz et al. (2010) and Wei et al. (2013) for 
RF-based techniques that are computationally feasible for 
genome-wide association studies. 

Omnibus permutation test on ensembles of two-locus 
analyses 

2LOmb is a filter technique which is specifically 
designed for detecting pure epistasis in case-control data 
(Wongseree et al. 2009). 2LOmb consists of four steps as 
follows. 

Two-locus analysis 

2LOmb begins by exhaustively performing two-locus 
analysis by tests. Each test determines the differ- 
ence between the distribution of two-locus genotypes in 
case and control samples. For a case-control data set con- 
taining n SNPs, (2) two-locus analyses are attained. Sub- 
sequently, the x'^'s j:7-value from each two-locus analysis 
is adjusted by a Bonferroni correction. The Bonferroni- 
corrected x^'s p-value from each two-locus analysis is 
min((2) X uncorrected x'^'s j?-value, 1). 

Permutation test 

A permutation test is performed to test the null hypo- 
thesis Hq that the ensemble e of two-locus analyses is 
not associated with the disease. To achieve this, a scalar 
statistic is first computed for the original case-control 
data set by combining Bonferroni-corrected x^ s /^-values 
for SNP pairs through a Fishers combining function 
(—2 J2i log(/?0)- The calculation of the Fisher's test statis- 
tic is then repeated for a set of permutation replicates. 
Each permutation replicate is constructed by randomly 
permuting the case/control status of each sample, which 
leads to different Bonferroni-corrected x^ s p-vdlues and 
Fisher s test statistic. The p-vslue of the null hypothesis Hq 
is then given by 

pl^\{i:l<i<t,Tf>T^}\/t (!) 



where Tf is the Fishers test statistic calculated for the 
permutation replicate /, Tq is the Fishers test statistic 
calculated for the original case-control data set, t is the 
number of permutation replicates and | • | denotes the size 
of a set. 

Global p-value determination 

Since multiple ensembles of two-locus analyses can be 
explored, the calculation of global p-vdlue is required to 
adjust for multiple hypothesis testing. The result is the p- 
value of the global null hypothesis Hq = ni<e<£^o 
which none of E explored ensembles is associated with the 
disease. Similar to other omnibus permutation tests, the 
same set of permutation replicates that gives the raw or 
unadjusted p-vdlue for each ensemble is also used to esti- 
mate the global p-vdlue. To obtain the global p-vdlue, the 
unadjusted j:7-value for the permutation replicate / of each 
hypothesis Hq is first calculated from 

P^^W:0<j<t,jy^i,T;>Tf}\/t. (2) 

The j:?-value of the global null hypothesis Hq is then given 
by 

/^global = \{i:l<i< Upf" < P^n\/t (3) 

where pf^^^ = mirvep^i is the minimum of unadjusted 
/^-values over the explored ensembles in the permuta- 
tion replicate / and p^^^ = mine pQ is the minimum of 
raw p'Vdlues over the explored ensembles in the original 
case-control data set. 

Search for the best ensemble of two-locus analyses 

The search for the best ensemble of two-locus analy- 
ses initialises by selecting the SNP pair with the lowest 
Bonferroni-corrected x^ s p-vdlue, which is a part of result 
from the first step of algorithm. A permutation test is then 
performed for this two-locus analysis, yielding both raw 
and global /7-values because only one hypothesis has been 
explored. If the raw and global p-vdlues of this first ensem- 
ble are statistically insignificant, the search terminates and 
the null hypothesis of no association cannot be rejected. 
Otherwise, the search continues by merging the SNP pair 
with the next lowest Bonferroni-corrected x^'s j?-value to 
the current best ensemble and re-evaluating the raw and 
global /7-values. The search continues progressively in this 
manner until either an increase in the raw or global p- 
value is observed or all possible SNP pairs are included in 
the ensemble. If the search terminates prior to the inclu- 
sion of all possible SNP pairs, the best ensemble is the one 
from the previous iteration. 

In this study, the significance level (a) to determine 
whether an ensemble is associated with the disease is 0.05 
and the number of permutation replicates is 10,000, which 
is proven to be sufficient in the early study (Wongseree et 
al. 2009). 2LOmb is publicly available from its homepage 
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(Detecting Purely Epistatic Multi-locus Interactions by an 
Omnibus Permutation Test on Ensembles of Two-locus 
Analyses 2009). 

Java LINkage disequilibrium plotter 

A Java LINkage disequilibrium plotter (JLIN) is a software 
package for the illustration of linkage disequilibrium pat- 
terns (Carter et al. 2006). JLIN is used to display/)^ and 
calculated for SNPs which are associated with TID. JLIN 
is publicly available from the Centre for Genetic Epidemi- 
ology and Biostatistics, University of Western Australia 
(JLIN 2010). 
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